fix(core): use native graceful process tree shutdown#33655
fix(core): use native graceful process tree shutdown#33655leosvelperez wants to merge 12 commits intomasterfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
✅ Deploy Preview for nx-docs ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
View your CI Pipeline Execution ↗ for commit 47f266b
☁️ Nx Cloud last updated this comment at |
🐳 We have a release for that!This PR has a release associated with it. You can try it out using this command: npx [email protected] my-workspaceOr just copy this version and use it in your own command: 0.0.0-pr-33655-040845c
To request a new release for this pull request, mention someone from the Nx team or the |
|
Hey @leosvelperez — I've been working on fixing the graceful shutdown regression that's been blocking this PR. I built on top of your Branch: What it doesAdded
Then wired it into all the TypeScript
Test resultsTested with the reproduction repo (3 concurrent services via
What's left
Happy to adjust anything based on your feedback. This has been a fun one to debug! |
|
@leosvelperez had a go at it with opus 4.6 and codex 5.3 xhigh, seems working in my local repro, wonder what you think about the solution |
040845c to
ae26480
Compare
✅ Deploy Preview for nx-dev ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
@agcty thanks so much for your continued investigation and work here! I'll take a look at your changes. At a quick glance, it's a subset of what I'm trying to do, but it might provide a quick/intermediate point we could use before I wrap up my changes. So, I'll review/test it locally and analyze whether it's worth using as a quick win. Will let you know later today. This branch was outdated, so I updated it. I apologize for any issues it might cause with your work on top of it. |
|
@agcty did you test it against the I can follow it up and fix whatever is left, but I want to understand what exactly you tested so I have the full picture of what's expected vs what I see. |
|
@leosvelperez I think it's not quite there yet, it seemed like it worked but I think I was on the other branch, sorry about that. This seems very very hard to debug, e.g nx changes env vars when it finds a claude md file amongst some other things which makes it not that easy to reason about. Will look into it some more today! Did you have another fix in mind that be architecturally more sound? |
|
@agcty, yes, I'm working on a bigger change to centralize/coordinate process shutdown requests/flow, but I'll explore your solution a bit more to see if, with some tweaks, we could still have a quicker win. |
|
@leosvelperez I see, thanks for the update! |
|
Failed to publish a PR release of this pull request, triggered by @leosvelperez. |
🐳 We have a release for that!This PR has a release associated with it. You can try it out using this command: npx [email protected] my-workspaceOr just copy this version and use it in your own command: 22.6.0-pr.33655.b3efa09
To request a new release for this pull request, mention someone from the Nx team or the |
df03045 to
4de9081
Compare
Building on the Rust-based process tree killer, this adds a
`killProcessTreeGraceful` async function that implements
SIGTERM → wait → SIGKILL escalation, preventing two issues:
1. Orphaned processes: the tree walker sends signals to all
descendants regardless of process group, so deep child processes
(e.g. doppler → bunx → bun → app) all receive SIGTERM.
2. Premature cleanup interruption: the main process now stays alive
during a configurable grace period (default 5s), keeping the PTY
master open and preventing the kernel from sending SIGHUP to
children before they finish their cleanup handlers.
The fire-and-forget `killProcessTree` is kept for synchronous
`process.on('exit')` handlers as a last resort.
Fixes #32438
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- forked-process-task-runner: await kill promises before process.exit() - running-tasks: per-child SIGINT handlers no longer call process.exit() to prevent first-child-exits race - pseudo-terminal: shutdown() uses sync killProcessTree for exit handler - pseudo-terminal: kill() uses killProcessTreeGraceful directly - process_killer: graceful kill detects reparented descendants Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
TaskOrchestrator.cleanup() now awaits the forked process runner's cleanup promise alongside other running tasks, ensuring all process trees are fully shut down before the orchestrator continues. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…reeGraceful Tests cover: - Killing a simple process - Killing a multi-level process tree (parent + children) - Graceful SIGTERM with cleanup handler verification - Slow cleanup within grace period (verifies process isn't killed prematurely) - Multi-level tree graceful shutdown - Already-dead process (no-op fast path) Skipped on Windows where Unix signals don't apply. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Rewrite killProcessTreeGraceful to signal leaf processes first, wait for them to exit, then signal their parents (now leaves). Keeps the PTY shell alive while descendants run cleanup handlers. - Signal only new leaves each iteration (signaled HashSet prevents duplicate SIGTERMs that interrupted execSync cleanup) - Reuse single System instance and Vec buffer across iterations - Defensive SIGKILL on !has_leaves to prevent process leaks - Accept string | number signal type (consistent with killProcessTree) - Add ?. null check in SeriallyRunningTasks.kill() - Add missing signalToCode import in forked-process-task-runner
Rust: - Log JoinError instead of silently swallowing with .ok() - Inline collect_process_tree_inner, drop unused require_root param - Remove duplicate buf copy before ProcessesToUpdate::Some Tests: - Fix SIGTERM test to use single node process (bottom-up kills shell children first, preventing trap from firing) - Fix tree test to poll for marker file instead of silently skipping grandchild assertion - Remove misleading graceful tree test (cleanup marker never written due to bottom-up kill order) - Add SIGKILL fallback test for SIGTERM-ignoring processes - Add afterEach cleanup, remove unused imports, widen timing bounds
- Early-return in collect_process_tree when root PID not found, skipping unnecessary process table iteration - Return children map from collect_process_tree to avoid redundant reconstruction in graceful kill path - Replace duck-typed .then check with Promise.resolve() wrapper in signal handler
…hutdown Orchestrator is now the single authority for signal dispatch in the direct-execution path. Children spawned with detached process groups (setsid) don't receive OS SIGINT; the orchestrator sends SIGTERM via killProcessTreeGraceful during cleanup. - exec() -> spawn(detached) in RunningNodeProcess for process group isolation - Guard per-child signal handlers with NX_FORKED_TASK_EXECUTOR (direct path delegates to orchestrator, forked path keeps handlers as primary mechanism) - Remove FPTR SIGTERM/SIGHUP handlers that consumed events before orchestrator - Remove FPTR SIGINT process.exit() (orchestrator owns exit decision) - Unify orchestrator signal handlers via handleSignal with process.on() to absorb signal-exit v3 re-raises, stopRequested as re-entrancy guard - Deduplicate kills between continuousSnapshot and runningRunCommandsTasks - Add killing flag and closedPromise to RunningNodeProcess for idempotent graceful shutdown with stdio drain - Rust: unified map_signal across platforms, force_kill_fallback for fire-and-forget killProcessTree, cleanup subprocess discovery, ProcessesToUpdate::All for tracking children spawned during shutdown - Windows: detached=false (children share console, get CTRL_C_EVENT), unsupported signals fall back to SIGKILL via force_kill_fallback
cleanUpUnneededContinuousTasks() fires kill() without await. With the boolean `killing` flag, cleanup() at the end of run() would call kill() again but no-op (flag already set), leaving the async kill unwaited. Detached children survive the parent's exit. Replace the boolean with a cached killPromise. Concurrent callers (cleanup after fire-and-forget kill) now await the same in-progress operation.
There was a problem hiding this comment.
Nx Cloud has identified a possible root cause for your failed CI:
We classified these CI failures as environment state rather than a code change. The detox:build-base failure is a TypeScript build dependency ordering issue where the react package output was not available — unrelated to our signal handling and process killer changes. The @nx/nx-source:populate-local-registry-storage failure (exit code 130) is a pre-existing flaky interruption with a 3% historical failure rate; a rerun should resolve both.
No code changes were suggested for this issue.
Trigger a rerun:
🔔 Heads up, your workspace has pending recommendations ↗ to auto-apply fixes for similar failures.
🎓 Learn more about Self-Healing CI on nx.dev
Current Behavior
When running continuous tasks (e.g.
nx e2ewith a dev server dependency), shutting down via Ctrl+C or SIGTERM can leave orphaned processes and leaked resources (e.g. Docker containers still running). The JStree-killpackage kills all processes flat — parents die before children can run cleanup handlers. Children also receive redundant signals (OS SIGINT + orchestrator SIGTERM + per-process handlers), causing race conditions and unpredictable shutdown ordering.Expected Behavior
Graceful, ordered shutdown: leaf processes are signaled first, allowed to run cleanup handlers (e.g.
docker compose stop), then parents are signaled. At most 1 orchestrator-dispatched signal per child. No orphaned processes, no leaked resources, cleanup subprocesses are protected during shutdown.Changes
Rust process killer (
process_killer/mod.rs)Two native functions via
sysinfoprocess table, replacing thetree-killnpm package:killProcessTree(sync, fire-and-forget) — snapshots tree, signals all descendants at once. Forprocess.on('exit')handlers. Falls back to SIGKILL on Windows.killProcessTreeGraceful(async, bottom-up) — signals leaves first, waits for exit, promotes parents, repeats. Tracks and protects cleanup subprocesses spawned during shutdown. SIGKILL escalation after grace period (default 5s).exec → spawn with detached groups (
running-tasks.ts)exec()→spawn({ shell: true, detached: true })— children get own process group viasetsid(), don't receive OS SIGINT on Ctrl+C. Only orchestrator dispatches signals.killPromisecaching — concurrent callers (e.g.cleanup()after fire-and-forget kill fromcleanUpUnneededContinuousTasks) await the same in-progress kill instead of no-oping. Prevents orphan detached processes on normal task completion.NX_FORKED_TASK_EXECUTOR— only active in forked path (no orchestrator).Orchestrator (
task-orchestrator.ts)process.on()(persistent) instead ofprocess.once()— prevents signal-exit re-raise before async cleanup completes.handleSignalwithstopRequestedguard for SIGINT/SIGTERM/SIGHUP.Forked process task runner (
forked-process-task-runner.ts)cleanup()awaited by orchestrator.process.exit().Other migrations
pseudo-terminal.ts,node-child-process.ts,run-script.impl.ts: replacetree-killwith native killers.tree-killdependency.Tests
Related Issue(s)
Fixes #32438